125 research outputs found
A Novel Windowing Technique for Efficient Computation of MFCC for Speaker Recognition
In this paper, we propose a novel family of windowing technique to compute
Mel Frequency Cepstral Coefficient (MFCC) for automatic speaker recognition
from speech. The proposed method is based on fundamental property of discrete
time Fourier transform (DTFT) related to differentiation in frequency domain.
Classical windowing scheme such as Hamming window is modified to obtain
derivatives of discrete time Fourier transform coefficients. It has been
mathematically shown that the slope and phase of power spectrum are inherently
incorporated in newly computed cepstrum. Speaker recognition systems based on
our proposed family of window functions are shown to attain substantial and
consistent performance improvement over baseline single tapered Hamming window
as well as recently proposed multitaper windowing technique
Learnable MFCCs for Speaker Verification
We propose a learnable mel-frequency cepstral coefficient (MFCC) frontend
architecture for deep neural network (DNN) based automatic speaker
verification. Our architecture retains the simplicity and interpretability of
MFCC-based features while allowing the model to be adapted to data flexibly. In
practice, we formulate data-driven versions of the four linear transforms of a
standard MFCC extractor -- windowing, discrete Fourier transform (DFT), mel
filterbank and discrete cosine transform (DCT). Results reported reach up to
6.7\% (VoxCeleb1) and 9.7\% (SITW) relative improvement in term of equal error
rate (EER) from static MFCCs, without additional tuning effort.Comment: Accepted to ISCAS 202
Optimization of data-driven filterbank for automatic speaker verification
Most of the speech processing applications use triangular filters spaced in
mel-scale for feature extraction. In this paper, we propose a new data-driven
filter design method which optimizes filter parameters from a given speech
data. First, we introduce a frame-selection based approach for developing
speech-signal-based frequency warping scale. Then, we propose a new method for
computing the filter frequency responses by using principal component analysis
(PCA). The main advantage of the proposed method over the recently introduced
deep learning based methods is that it requires very limited amount of
unlabeled speech-data. We demonstrate that the proposed filterbank has more
speaker discriminative power than commonly used mel filterbank as well as
existing data-driven filterbank. We conduct automatic speaker verification
(ASV) experiments with different corpora using various classifier back-ends. We
show that the acoustic features created with proposed filterbank are better
than existing mel-frequency cepstral coefficients (MFCCs) and
speech-signal-based frequency cepstral coefficients (SFCCs) in most cases. In
the experiments with VoxCeleb1 and popular i-vector back-end, we observe 9.75%
relative improvement in equal error rate (EER) over MFCCs. Similarly, the
relative improvement is 4.43% with recently introduced x-vector system. We
obtain further improvement using fusion of the proposed method with standard
MFCC-based approach.Comment: Published in Digital Signal Processing journal (Elsevier
Quality Measures for Speaker Verification with Short Utterances
The performances of the automatic speaker verification (ASV) systems degrade
due to the reduction in the amount of speech used for enrollment and
verification. Combining multiple systems based on different features and
classifiers considerably reduces speaker verification error rate with short
utterances. This work attempts to incorporate supplementary information during
the system combination process. We use quality of the estimated model
parameters as supplementary information. We introduce a class of novel quality
measures formulated using the zero-order sufficient statistics used during the
i-vector extraction process. We have used the proposed quality measures as side
information for combining ASV systems based on Gaussian mixture model-universal
background model (GMM-UBM) and i-vector. The proposed methods demonstrate
considerable improvement in speaker recognition performance on NIST SRE
corpora, especially in short duration conditions. We have also observed
improvement over existing systems based on different duration-based quality
measures.Comment: Accepted for publication in Digital Signal Processing: A Review
Journa
- …